In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.decomposition import PCA
from sklearn import preprocessing
from sklearn.cluster import KMeans, AgglomerativeClustering
In [ ]:
pd.set_option('display.float_format', lambda x: '%.3f' % x)
In [ ]:
#show all columns
pd.set_option('display.max_columns', None)

Cleaning¶

Set up and merging¶

In [ ]:
rank = pd.read_csv('colleges.csv')
rank.head()
Out[ ]:
Unnamed: 0 College Name Tuition Enrollment Numbers
0 0 Princeton University 56010 4773
1 1 Columbia University 63530 6170
2 2 Harvard University 55587 5222
3 3 Massachusetts Institute of Technology 55878 4361
4 4 Yale University 59950 4703
In [ ]:
rank.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 392 entries, 0 to 391
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Unnamed: 0          392 non-null    int64 
 1   College Name        392 non-null    object
 2   Tuition             392 non-null    int64 
 3   Enrollment Numbers  392 non-null    int64 
dtypes: int64(3), object(1)
memory usage: 12.4+ KB
In [ ]:
rank.rename(columns={'Unnamed: 0':'Rank','College Name':'Name'}, inplace=True)
In [ ]:
college_data = pd.read_csv('Data-Table 1.csv')
college_data.head()
Out[ ]:
Name Applicants total Admissions total Enrolled total Percent of freshmen submitting SAT scores Percent of freshmen submitting ACT scores SAT Critical Reading 25th percentile score SAT Critical Reading 75th percentile score SAT Math 25th percentile score SAT Math 75th percentile score SAT Writing 25th percentile score SAT Writing 75th percentile score ACT Composite 25th percentile score ACT Composite 75th percentile score State abbreviation Geographic region Control of institution Historically Black College or University Degree of urbanization (Urban centric locale) Carnegie Classification 2010: Basic Total enrollment Full time enrollment Part time enrollment Undergraduate enrollment Graduate enrollment Full time undergraduate enrollment Part time undergraduate enrollment Percent of total enrollment that are Asian Percent of total enrollment that are Black or African American Percent of total enrollment that are Hispanic/Latino Percent of total enrollment that are Native Hawaiian or Other Pacific Islander Percent of total enrollment that are White Percent of total enrollment that are two or more races Percent of total enrollment that are Nonresident Alien Percent of total enrollment that are women Percent of undergraduate enrollment that are American Indian or Alaska Native Number of first time undergraduates in state Number of first time undergraduates out of state Number of first time undergraduates foreign countries Number of first time undergraduates residence unknown Graduation rate Bachelor degree within 4 years, total Graduation rate Bachelor degree within 5 years, total Graduation rate Bachelor degree within 6 years, total Percent of freshmen receiving any financial aid Percent of freshmen receiving federal grant aid Percent of freshmen receiving Pell grants Percent of freshmen receiving institutional grant aid Percent of freshmen receiving student loan aid Endowment assets
0 Alabama A & M University 6142.000 5521.000 1104.000 15.000 88.000 370.000 450.000 350.000 450.000 NaN NaN 15.000 19.000 Alabama Southeast AL AR FL GA KY LA MS NC SC TN VA WV Public Yes City: Midsize Master's Colleges and Universities (larger pro... 5020.000 4439.000 581.000 4051.000 969.000 3799.000 252.000 1.000 92.000 1.000 0.000 5.000 0.000 0.000 55.000 0.000 NaN NaN NaN NaN 10.000 23.000 29.000 97.000 81.000 81.000 32.000 89.000 0
1 University of Alabama at Birmingham 5689.000 4934.000 1773.000 6.000 93.000 520.000 640.000 520.000 650.000 NaN NaN 22.000 28.000 Alabama Southeast AL AR FL GA KY LA MS NC SC TN VA WV Public No City: Midsize Research Universities (very high research acti... 18568.000 11961.000 6607.000 11502.000 7066.000 8357.000 3145.000 5.000 21.000 3.000 0.000 64.000 3.000 3.000 61.000 0.000 1529.000 224.000 19.000 1.000 29.000 46.000 53.000 90.000 36.000 36.000 60.000 56.000 24136
2 Amridge University NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Alabama Southeast AL AR FL GA KY LA MS NC SC TN VA WV Private not for profit No City: Midsize Baccalaureate Colleges Arts & Sciences 631.000 323.000 308.000 322.000 309.000 202.000 120.000 0.000 40.000 1.000 0.000 30.000 0.000 0.000 58.000 0.000 NaN NaN NaN NaN 0.000 0.000 67.000 100.000 90.000 90.000 90.000 100.000 302
3 University of Alabama at Huntsville 2054.000 1656.000 651.000 34.000 94.000 510.000 640.000 510.000 650.000 NaN NaN 23.000 29.000 Alabama Southeast AL AR FL GA KY LA MS NC SC TN VA WV Public No City: Midsize Research Universities (very high research acti... 7376.000 4802.000 2574.000 5696.000 1680.000 4237.000 1459.000 4.000 12.000 3.000 0.000 69.000 1.000 6.000 44.000 1.000 514.000 92.000 27.000 18.000 16.000 37.000 48.000 87.000 31.000 31.000 63.000 46.000 11502
4 Alabama State University 10245.000 5251.000 1479.000 18.000 87.000 380.000 480.000 370.000 480.000 NaN NaN 15.000 19.000 Alabama Southeast AL AR FL GA KY LA MS NC SC TN VA WV Public Yes City: Midsize Master's Colleges and Universities (larger pro... 6075.000 5182.000 893.000 5356.000 719.000 4872.000 484.000 0.000 91.000 1.000 0.000 3.000 1.000 2.000 61.000 0.000 903.000 571.000 67.000 4.000 9.000 19.000 25.000 93.000 76.000 76.000 34.000 81.000 13202
In [ ]:
#nulls for each column
college_data.isnull().sum()
Out[ ]:
Name                                                                                0
Applicants total                                                                  157
Admissions total                                                                  157
Enrolled total                                                                    157
Percent of freshmen submitting SAT scores                                         277
Percent of freshmen submitting ACT scores                                         275
SAT Critical Reading 25th percentile score                                        365
SAT Critical Reading 75th percentile score                                        365
SAT Math 25th percentile score                                                    352
SAT Math 75th percentile score                                                    352
SAT Writing 25th percentile score                                                 820
SAT Writing 75th percentile score                                                 820
ACT Composite 25th percentile score                                               335
ACT Composite 75th percentile score                                               335
State abbreviation                                                                  0
Geographic region                                                                   0
Control of institution                                                              0
Historically Black College or University                                            0
Degree of urbanization (Urban centric locale)                                       0
Carnegie Classification 2010: Basic                                                 0
Total enrollment                                                                    2
Full time enrollment                                                                2
Part time enrollment                                                                2
Undergraduate enrollment                                                            2
Graduate enrollment                                                                 2
Full time undergraduate enrollment                                                  2
Part time undergraduate enrollment                                                  2
Percent of total enrollment that are Asian                                          2
Percent of total enrollment that are Black or African American                      2
Percent of total enrollment that are Hispanic/Latino                                2
Percent of total enrollment that are Native Hawaiian or Other Pacific Islander      2
Percent of total enrollment that are White                                          2
Percent of total enrollment that are two or more races                              2
Percent of total enrollment that are Nonresident Alien                              2
Percent of total enrollment that are women                                          2
Percent of undergraduate enrollment that are American Indian or Alaska Native      12
Number of first time undergraduates  in state                                     623
Number of first time undergraduates  out of state                                 623
Number of first time undergraduates  foreign countries                            623
Number of first time undergraduates  residence unknown                            623
Graduation rate  Bachelor degree within 4 years, total                             58
Graduation rate  Bachelor degree within 5 years, total                             58
Graduation rate  Bachelor degree within 6 years, total                             58
Percent of freshmen receiving any financial aid                                    42
Percent of freshmen receiving federal grant aid                                    42
Percent of freshmen receiving Pell grants                                          42
Percent of freshmen receiving institutional grant aid                              42
Percent of freshmen receiving student loan aid                                     42
Endowment assets                                                                    0
dtype: int64
In [ ]:
college_data.columns
Out[ ]:
Index(['Name', 'Applicants total', 'Admissions total', 'Enrolled total',
       'Percent of freshmen submitting SAT scores',
       'Percent of freshmen submitting ACT scores',
       'SAT Critical Reading 25th percentile score',
       'SAT Critical Reading 75th percentile score',
       'SAT Math 25th percentile score', 'SAT Math 75th percentile score',
       'SAT Writing 25th percentile score',
       'SAT Writing 75th percentile score',
       'ACT Composite 25th percentile score',
       'ACT Composite 75th percentile score', 'State abbreviation',
       'Geographic region', 'Control of institution',
       'Historically Black College or University',
       'Degree of urbanization (Urban centric locale)',
       'Carnegie Classification 2010: Basic', 'Total enrollment',
       'Full time enrollment', 'Part time enrollment',
       'Undergraduate enrollment', 'Graduate enrollment',
       'Full time undergraduate enrollment',
       'Part time undergraduate enrollment',
       'Percent of total enrollment that are Asian',
       'Percent of total enrollment that are Black or African American',
       'Percent of total enrollment that are Hispanic/Latino',
       'Percent of total enrollment that are Native Hawaiian or Other Pacific Islander',
       'Percent of total enrollment that are White',
       'Percent of total enrollment that are two or more races',
       'Percent of total enrollment that are Nonresident Alien',
       'Percent of total enrollment that are women',
       'Percent of undergraduate enrollment that are American Indian or Alaska Native',
       'Number of first time undergraduates  in state',
       'Number of first time undergraduates  out of state',
       'Number of first time undergraduates  foreign countries',
       'Number of first time undergraduates  residence unknown',
       'Graduation rate  Bachelor degree within 4 years, total',
       'Graduation rate  Bachelor degree within 5 years, total',
       'Graduation rate  Bachelor degree within 6 years, total',
       'Percent of freshmen receiving any financial aid',
       'Percent of freshmen receiving federal grant aid',
       'Percent of freshmen receiving Pell grants',
       'Percent of freshmen receiving institutional grant aid',
       'Percent of freshmen receiving student loan aid', 'Endowment assets'],
      dtype='object')
In [ ]:
college_data['Name'] = college_data['Name'].str.replace('[#,@,&,+,*,%,$,^,!,~,.]', '')
college_data['Name'] = college_data['Name'].str.replace('The ', '')
college_data['Name'] = college_data['Name'].str.replace(' at ', ' ')
college_data['Name'] = college_data['Name'].str.replace('Main Campus', '')
college_data['Name'] = college_data['Name'].str.replace(' and ',' ')
college_data['Name'] = college_data['Name'].str.replace('  ', ' ')
college_data['Name'] = college_data['Name'].str.strip()
C:\Users\jesse\AppData\Local\Temp\ipykernel_16476\904618103.py:1: FutureWarning: The default value of regex will change from True to False in a future version.
  college_data['Name'] = college_data['Name'].str.replace('[#,@,&,+,*,%,$,^,!,~,.]', '')
In [ ]:
rank['Name'] = rank['Name'].str.replace('[#,@,&,+,*,%,$,^,!,~,.]', '')
rank['Name'] = rank['Name'].str.replace('The ', '')
rank['Name'] = rank['Name'].str.replace(' at ', ' ')
rank['Name'] = rank['Name'].str.replace(' and ',' ')
rank['Name'] = rank['Name'].str.replace('--', ' ')
rank['Name'] = rank['Name'].str.replace('-', ' ')
rank['Name'] = rank['Name'].str.replace('  ', ' ')
rank['Name'] = rank['Name'].str.strip()
C:\Users\jesse\AppData\Local\Temp\ipykernel_16476\3916203632.py:1: FutureWarning: The default value of regex will change from True to False in a future version.
  rank['Name'] = rank['Name'].str.replace('[#,@,&,+,*,%,$,^,!,~,.]', '')
In [ ]:
#merge the two dataframes on the college name
data = pd.merge(rank, college_data, on='Name', how='outer')
data.head()
Out[ ]:
Rank Name Tuition Enrollment Numbers Applicants total Admissions total Enrolled total Percent of freshmen submitting SAT scores Percent of freshmen submitting ACT scores SAT Critical Reading 25th percentile score SAT Critical Reading 75th percentile score SAT Math 25th percentile score SAT Math 75th percentile score SAT Writing 25th percentile score SAT Writing 75th percentile score ACT Composite 25th percentile score ACT Composite 75th percentile score State abbreviation Geographic region Control of institution Historically Black College or University Degree of urbanization (Urban centric locale) Carnegie Classification 2010: Basic Total enrollment Full time enrollment Part time enrollment Undergraduate enrollment Graduate enrollment Full time undergraduate enrollment Part time undergraduate enrollment Percent of total enrollment that are Asian Percent of total enrollment that are Black or African American Percent of total enrollment that are Hispanic/Latino Percent of total enrollment that are Native Hawaiian or Other Pacific Islander Percent of total enrollment that are White Percent of total enrollment that are two or more races Percent of total enrollment that are Nonresident Alien Percent of total enrollment that are women Percent of undergraduate enrollment that are American Indian or Alaska Native Number of first time undergraduates in state Number of first time undergraduates out of state Number of first time undergraduates foreign countries Number of first time undergraduates residence unknown Graduation rate Bachelor degree within 4 years, total Graduation rate Bachelor degree within 5 years, total Graduation rate Bachelor degree within 6 years, total Percent of freshmen receiving any financial aid Percent of freshmen receiving federal grant aid Percent of freshmen receiving Pell grants Percent of freshmen receiving institutional grant aid Percent of freshmen receiving student loan aid Endowment assets
0 0.000 Princeton University 56010.000 4773.000 26499.000 1963.000 1285.000 86.000 33.000 700.000 800.000 710.000 800.000 710.000 790.000 31.000 35.000 New Jersey Mid East DE DC MD NJ NY PA Private not for profit No Suburb: Large Research Universities (very high research acti... 8014.000 7935.000 79.000 5323.000 2691.000 5244.000 79.000 15.000 6.000 7.000 0.000 45.000 4.000 20.000 45.000 0.000 197.000 929.000 157.000 1.000 88.000 95.000 97.000 60.000 14.000 14.000 60.000 9.000 2320421.000
1 1.000 Columbia University 63530.000 6170.000 31851.000 2362.000 1415.000 90.000 32.000 690.000 780.000 700.000 790.000 690.000 780.000 31.000 34.000 New York Mid East DE DC MD NJ NY PA Private not for profit No City: Large Research Universities (very high research acti... 26957.000 22731.000 4226.000 7970.000 18987.000 7374.000 596.000 13.000 5.000 8.000 0.000 36.000 3.000 28.000 51.000 1.000 324.000 961.000 224.000 0.000 86.000 92.000 93.000 57.000 15.000 15.000 49.000 16.000 316753.000
2 2.000 Harvard University 55587.000 5222.000 35023.000 2047.000 1659.000 86.000 38.000 700.000 800.000 710.000 800.000 710.000 800.000 32.000 35.000 Massachusetts New England CT ME MA NH RI VT Private not for profit No City: Midsize Research Universities (very high research acti... 28297.000 20370.000 7927.000 10534.000 17763.000 7240.000 3294.000 13.000 5.000 7.000 0.000 45.000 3.000 21.000 49.000 0.000 NaN NaN NaN NaN 87.000 95.000 97.000 75.000 15.000 15.000 58.000 9.000 1392761.000
3 3.000 Massachusetts Institute of Technology 55878.000 4361.000 18989.000 1548.000 1115.000 85.000 40.000 680.000 770.000 750.000 800.000 690.000 780.000 33.000 35.000 Massachusetts New England CT ME MA NH RI VT Private not for profit No City: Midsize Research Universities (very high research acti... 11301.000 11138.000 163.000 4528.000 6773.000 4499.000 29.000 16.000 3.000 9.000 0.000 34.000 3.000 29.000 37.000 0.000 75.000 929.000 110.000 1.000 84.000 91.000 93.000 87.000 18.000 16.000 55.000 19.000 980404.000
4 4.000 Yale University 59950.000 4703.000 28977.000 2043.000 1356.000 84.000 35.000 700.000 800.000 710.000 790.000 710.000 800.000 32.000 35.000 Connecticut New England CT ME MA NH RI VT Private not for profit No City: Midsize Research Universities (very high research acti... 12109.000 11927.000 182.000 5430.000 6679.000 5424.000 6.000 13.000 5.000 7.000 0.000 48.000 4.000 18.000 49.000 1.000 82.000 1113.000 162.000 1.000 90.000 96.000 98.000 61.000 13.000 13.000 50.000 6.000 1528324.000
In [ ]:
just_rank = data[(data['Rank'].notnull())&(data['Applicants total'].isnull())]
just_rank.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 27 entries, 52 to 392
Data columns (total 52 columns):
 #   Column                                                                          Non-Null Count  Dtype  
---  ------                                                                          --------------  -----  
 0   Rank                                                                            27 non-null     float64
 1   Name                                                                            27 non-null     object 
 2   Tuition                                                                         27 non-null     float64
 3   Enrollment Numbers                                                              27 non-null     float64
 4   Applicants total                                                                0 non-null      float64
 5   Admissions total                                                                0 non-null      float64
 6   Enrolled total                                                                  0 non-null      float64
 7   Percent of freshmen submitting SAT scores                                       0 non-null      float64
 8   Percent of freshmen submitting ACT scores                                       0 non-null      float64
 9   SAT Critical Reading 25th percentile score                                      0 non-null      float64
 10  SAT Critical Reading 75th percentile score                                      0 non-null      float64
 11  SAT Math 25th percentile score                                                  0 non-null      float64
 12  SAT Math 75th percentile score                                                  0 non-null      float64
 13  SAT Writing 25th percentile score                                               0 non-null      float64
 14  SAT Writing 75th percentile score                                               0 non-null      float64
 15  ACT Composite 25th percentile score                                             0 non-null      float64
 16  ACT Composite 75th percentile score                                             0 non-null      float64
 17  State abbreviation                                                              8 non-null      object 
 18  Geographic region                                                               8 non-null      object 
 19  Control of institution                                                          8 non-null      object 
 20  Historically Black College or University                                        8 non-null      object 
 21  Degree of urbanization (Urban centric locale)                                   8 non-null      object 
 22  Carnegie Classification 2010: Basic                                             8 non-null      object 
 23  Total enrollment                                                                8 non-null      float64
 24  Full time enrollment                                                            8 non-null      float64
 25  Part time enrollment                                                            8 non-null      float64
 26  Undergraduate enrollment                                                        8 non-null      float64
 27  Graduate enrollment                                                             8 non-null      float64
 28  Full time undergraduate enrollment                                              8 non-null      float64
 29  Part time undergraduate enrollment                                              8 non-null      float64
 30  Percent of total enrollment that are Asian                                      8 non-null      float64
 31  Percent of total enrollment that are Black or African American                  8 non-null      float64
 32  Percent of total enrollment that are Hispanic/Latino                            8 non-null      float64
 33  Percent of total enrollment that are Native Hawaiian or Other Pacific Islander  8 non-null      float64
 34  Percent of total enrollment that are White                                      8 non-null      float64
 35  Percent of total enrollment that are two or more races                          8 non-null      float64
 36  Percent of total enrollment that are Nonresident Alien                          8 non-null      float64
 37  Percent of total enrollment that are women                                      8 non-null      float64
 38  Percent of undergraduate enrollment that are American Indian or Alaska Native   8 non-null      float64
 39  Number of first time undergraduates  in state                                   4 non-null      float64
 40  Number of first time undergraduates  out of state                               4 non-null      float64
 41  Number of first time undergraduates  foreign countries                          4 non-null      float64
 42  Number of first time undergraduates  residence unknown                          4 non-null      float64
 43  Graduation rate  Bachelor degree within 4 years, total                          7 non-null      float64
 44  Graduation rate  Bachelor degree within 5 years, total                          7 non-null      float64
 45  Graduation rate  Bachelor degree within 6 years, total                          7 non-null      float64
 46  Percent of freshmen receiving any financial aid                                 7 non-null      float64
 47  Percent of freshmen receiving federal grant aid                                 7 non-null      float64
 48  Percent of freshmen receiving Pell grants                                       7 non-null      float64
 49  Percent of freshmen receiving institutional grant aid                           7 non-null      float64
 50  Percent of freshmen receiving student loan aid                                  7 non-null      float64
 51  Endowment assets                                                                8 non-null      float64
dtypes: float64(45), object(7)
memory usage: 11.2+ KB
In [ ]:
just_rank = just_rank[just_rank['Endowment assets'].isnull()]
In [ ]:
just_rank['Name'].unique()
Out[ ]:
array(['Purdue University West Lafayette',
       'Pennsylvania State University University Park',
       'University of California Merced', 'Thomas Jefferson University',
       'Russell Sage College',
       'Inter American University of Puerto Rico San German',
       'Tennessee Techn University', 'Long Island University',
       'University of Puerto Rico Rio Piedras', 'Augusta University',
       'Colorado Technical University', 'Grand Canyon University',
       'Inter American University of Puerto Rico Metropolitan Campus',
       'Keiser University', 'Mary Baldwin University',
       'Pontifical Catholic University of Puerto Rico Ponce',
       'Universidad Ana G Mendez Gurabo Campus', 'University of Phoenix',
       'University of Texas Rio Grande Valley'], dtype=object)
In [ ]:
data.drop(just_rank.index, inplace=True)
In [ ]:
data[data['Name']=='University of Texas Rio Grande Valley']
Out[ ]:
Rank Name Tuition Enrollment Numbers Applicants total Admissions total Enrolled total Percent of freshmen submitting SAT scores Percent of freshmen submitting ACT scores SAT Critical Reading 25th percentile score SAT Critical Reading 75th percentile score SAT Math 25th percentile score SAT Math 75th percentile score SAT Writing 25th percentile score SAT Writing 75th percentile score ACT Composite 25th percentile score ACT Composite 75th percentile score State abbreviation Geographic region Control of institution Historically Black College or University Degree of urbanization (Urban centric locale) Carnegie Classification 2010: Basic Total enrollment Full time enrollment Part time enrollment Undergraduate enrollment Graduate enrollment Full time undergraduate enrollment Part time undergraduate enrollment Percent of total enrollment that are Asian Percent of total enrollment that are Black or African American Percent of total enrollment that are Hispanic/Latino Percent of total enrollment that are Native Hawaiian or Other Pacific Islander Percent of total enrollment that are White Percent of total enrollment that are two or more races Percent of total enrollment that are Nonresident Alien Percent of total enrollment that are women Percent of undergraduate enrollment that are American Indian or Alaska Native Number of first time undergraduates in state Number of first time undergraduates out of state Number of first time undergraduates foreign countries Number of first time undergraduates residence unknown Graduation rate Bachelor degree within 4 years, total Graduation rate Bachelor degree within 5 years, total Graduation rate Bachelor degree within 6 years, total Percent of freshmen receiving any financial aid Percent of freshmen receiving federal grant aid Percent of freshmen receiving Pell grants Percent of freshmen receiving institutional grant aid Percent of freshmen receiving student loan aid Endowment assets
In [ ]:
just_data = data[(data['Rank'].isnull())&(data['Applicants total'].notnull())]
just_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1010 entries, 394 to 1551
Data columns (total 52 columns):
 #   Column                                                                          Non-Null Count  Dtype  
---  ------                                                                          --------------  -----  
 0   Rank                                                                            0 non-null      float64
 1   Name                                                                            1010 non-null   object 
 2   Tuition                                                                         0 non-null      float64
 3   Enrollment Numbers                                                              0 non-null      float64
 4   Applicants total                                                                1010 non-null   float64
 5   Admissions total                                                                1010 non-null   float64
 6   Enrolled total                                                                  1010 non-null   float64
 7   Percent of freshmen submitting SAT scores                                       903 non-null    float64
 8   Percent of freshmen submitting ACT scores                                       904 non-null    float64
 9   SAT Critical Reading 25th percentile score                                      832 non-null    float64
 10  SAT Critical Reading 75th percentile score                                      832 non-null    float64
 11  SAT Math 25th percentile score                                                  840 non-null    float64
 12  SAT Math 75th percentile score                                                  840 non-null    float64
 13  SAT Writing 25th percentile score                                               507 non-null    float64
 14  SAT Writing 75th percentile score                                               507 non-null    float64
 15  ACT Composite 25th percentile score                                             858 non-null    float64
 16  ACT Composite 75th percentile score                                             858 non-null    float64
 17  State abbreviation                                                              1010 non-null   object 
 18  Geographic region                                                               1010 non-null   object 
 19  Control of institution                                                          1010 non-null   object 
 20  Historically Black College or University                                        1010 non-null   object 
 21  Degree of urbanization (Urban centric locale)                                   1010 non-null   object 
 22  Carnegie Classification 2010: Basic                                             1010 non-null   object 
 23  Total enrollment                                                                1010 non-null   float64
 24  Full time enrollment                                                            1010 non-null   float64
 25  Part time enrollment                                                            1010 non-null   float64
 26  Undergraduate enrollment                                                        1010 non-null   float64
 27  Graduate enrollment                                                             1010 non-null   float64
 28  Full time undergraduate enrollment                                              1010 non-null   float64
 29  Part time undergraduate enrollment                                              1010 non-null   float64
 30  Percent of total enrollment that are Asian                                      1010 non-null   float64
 31  Percent of total enrollment that are Black or African American                  1010 non-null   float64
 32  Percent of total enrollment that are Hispanic/Latino                            1010 non-null   float64
 33  Percent of total enrollment that are Native Hawaiian or Other Pacific Islander  1010 non-null   float64
 34  Percent of total enrollment that are White                                      1010 non-null   float64
 35  Percent of total enrollment that are two or more races                          1010 non-null   float64
 36  Percent of total enrollment that are Nonresident Alien                          1010 non-null   float64
 37  Percent of total enrollment that are women                                      1010 non-null   float64
 38  Percent of undergraduate enrollment that are American Indian or Alaska Native   1010 non-null   float64
 39  Number of first time undergraduates  in state                                   591 non-null    float64
 40  Number of first time undergraduates  out of state                               591 non-null    float64
 41  Number of first time undergraduates  foreign countries                          591 non-null    float64
 42  Number of first time undergraduates  residence unknown                          591 non-null    float64
 43  Graduation rate  Bachelor degree within 4 years, total                          1002 non-null   float64
 44  Graduation rate  Bachelor degree within 5 years, total                          1002 non-null   float64
 45  Graduation rate  Bachelor degree within 6 years, total                          1002 non-null   float64
 46  Percent of freshmen receiving any financial aid                                 1007 non-null   float64
 47  Percent of freshmen receiving federal grant aid                                 1007 non-null   float64
 48  Percent of freshmen receiving Pell grants                                       1007 non-null   float64
 49  Percent of freshmen receiving institutional grant aid                           1007 non-null   float64
 50  Percent of freshmen receiving student loan aid                                  1007 non-null   float64
 51  Endowment assets                                                                1010 non-null   float64
dtypes: float64(45), object(7)
memory usage: 418.2+ KB
In [ ]:
complete = data[(data['Rank'].notnull())&(data['Applicants total'].notnull())]
complete.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 367 entries, 0 to 393
Data columns (total 52 columns):
 #   Column                                                                          Non-Null Count  Dtype  
---  ------                                                                          --------------  -----  
 0   Rank                                                                            367 non-null    float64
 1   Name                                                                            367 non-null    object 
 2   Tuition                                                                         367 non-null    float64
 3   Enrollment Numbers                                                              367 non-null    float64
 4   Applicants total                                                                367 non-null    float64
 5   Admissions total                                                                367 non-null    float64
 6   Enrolled total                                                                  367 non-null    float64
 7   Percent of freshmen submitting SAT scores                                       354 non-null    float64
 8   Percent of freshmen submitting ACT scores                                       355 non-null    float64
 9   SAT Critical Reading 25th percentile score                                      337 non-null    float64
 10  SAT Critical Reading 75th percentile score                                      337 non-null    float64
 11  SAT Math 25th percentile score                                                  342 non-null    float64
 12  SAT Math 75th percentile score                                                  342 non-null    float64
 13  SAT Writing 25th percentile score                                               207 non-null    float64
 14  SAT Writing 75th percentile score                                               207 non-null    float64
 15  ACT Composite 25th percentile score                                             341 non-null    float64
 16  ACT Composite 75th percentile score                                             341 non-null    float64
 17  State abbreviation                                                              367 non-null    object 
 18  Geographic region                                                               367 non-null    object 
 19  Control of institution                                                          367 non-null    object 
 20  Historically Black College or University                                        367 non-null    object 
 21  Degree of urbanization (Urban centric locale)                                   367 non-null    object 
 22  Carnegie Classification 2010: Basic                                             367 non-null    object 
 23  Total enrollment                                                                367 non-null    float64
 24  Full time enrollment                                                            367 non-null    float64
 25  Part time enrollment                                                            367 non-null    float64
 26  Undergraduate enrollment                                                        367 non-null    float64
 27  Graduate enrollment                                                             367 non-null    float64
 28  Full time undergraduate enrollment                                              367 non-null    float64
 29  Part time undergraduate enrollment                                              367 non-null    float64
 30  Percent of total enrollment that are Asian                                      367 non-null    float64
 31  Percent of total enrollment that are Black or African American                  367 non-null    float64
 32  Percent of total enrollment that are Hispanic/Latino                            367 non-null    float64
 33  Percent of total enrollment that are Native Hawaiian or Other Pacific Islander  367 non-null    float64
 34  Percent of total enrollment that are White                                      367 non-null    float64
 35  Percent of total enrollment that are two or more races                          367 non-null    float64
 36  Percent of total enrollment that are Nonresident Alien                          367 non-null    float64
 37  Percent of total enrollment that are women                                      367 non-null    float64
 38  Percent of undergraduate enrollment that are American Indian or Alaska Native   367 non-null    float64
 39  Number of first time undergraduates  in state                                   268 non-null    float64
 40  Number of first time undergraduates  out of state                               268 non-null    float64
 41  Number of first time undergraduates  foreign countries                          268 non-null    float64
 42  Number of first time undergraduates  residence unknown                          268 non-null    float64
 43  Graduation rate  Bachelor degree within 4 years, total                          365 non-null    float64
 44  Graduation rate  Bachelor degree within 5 years, total                          365 non-null    float64
 45  Graduation rate  Bachelor degree within 6 years, total                          365 non-null    float64
 46  Percent of freshmen receiving any financial aid                                 366 non-null    float64
 47  Percent of freshmen receiving federal grant aid                                 366 non-null    float64
 48  Percent of freshmen receiving Pell grants                                       366 non-null    float64
 49  Percent of freshmen receiving institutional grant aid                           366 non-null    float64
 50  Percent of freshmen receiving student loan aid                                  366 non-null    float64
 51  Endowment assets                                                                367 non-null    float64
dtypes: float64(45), object(7)
memory usage: 152.0+ KB
In [ ]:
just_rank.to_csv('just_rank.csv')

Prep for modeling¶

In [ ]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1534 entries, 0 to 1552
Data columns (total 52 columns):
 #   Column                                                                          Non-Null Count  Dtype  
---  ------                                                                          --------------  -----  
 0   Rank                                                                            375 non-null    float64
 1   Name                                                                            1534 non-null   object 
 2   Tuition                                                                         375 non-null    float64
 3   Enrollment Numbers                                                              375 non-null    float64
 4   Applicants total                                                                1377 non-null   float64
 5   Admissions total                                                                1377 non-null   float64
 6   Enrolled total                                                                  1377 non-null   float64
 7   Percent of freshmen submitting SAT scores                                       1257 non-null   float64
 8   Percent of freshmen submitting ACT scores                                       1259 non-null   float64
 9   SAT Critical Reading 25th percentile score                                      1169 non-null   float64
 10  SAT Critical Reading 75th percentile score                                      1169 non-null   float64
 11  SAT Math 25th percentile score                                                  1182 non-null   float64
 12  SAT Math 75th percentile score                                                  1182 non-null   float64
 13  SAT Writing 25th percentile score                                               714 non-null    float64
 14  SAT Writing 75th percentile score                                               714 non-null    float64
 15  ACT Composite 25th percentile score                                             1199 non-null   float64
 16  ACT Composite 75th percentile score                                             1199 non-null   float64
 17  State abbreviation                                                              1534 non-null   object 
 18  Geographic region                                                               1534 non-null   object 
 19  Control of institution                                                          1534 non-null   object 
 20  Historically Black College or University                                        1534 non-null   object 
 21  Degree of urbanization (Urban centric locale)                                   1534 non-null   object 
 22  Carnegie Classification 2010: Basic                                             1534 non-null   object 
 23  Total enrollment                                                                1532 non-null   float64
 24  Full time enrollment                                                            1532 non-null   float64
 25  Part time enrollment                                                            1532 non-null   float64
 26  Undergraduate enrollment                                                        1532 non-null   float64
 27  Graduate enrollment                                                             1532 non-null   float64
 28  Full time undergraduate enrollment                                              1532 non-null   float64
 29  Part time undergraduate enrollment                                              1532 non-null   float64
 30  Percent of total enrollment that are Asian                                      1532 non-null   float64
 31  Percent of total enrollment that are Black or African American                  1532 non-null   float64
 32  Percent of total enrollment that are Hispanic/Latino                            1532 non-null   float64
 33  Percent of total enrollment that are Native Hawaiian or Other Pacific Islander  1532 non-null   float64
 34  Percent of total enrollment that are White                                      1532 non-null   float64
 35  Percent of total enrollment that are two or more races                          1532 non-null   float64
 36  Percent of total enrollment that are Nonresident Alien                          1532 non-null   float64
 37  Percent of total enrollment that are women                                      1532 non-null   float64
 38  Percent of undergraduate enrollment that are American Indian or Alaska Native   1522 non-null   float64
 39  Number of first time undergraduates  in state                                   911 non-null    float64
 40  Number of first time undergraduates  out of state                               911 non-null    float64
 41  Number of first time undergraduates  foreign countries                          911 non-null    float64
 42  Number of first time undergraduates  residence unknown                          911 non-null    float64
 43  Graduation rate  Bachelor degree within 4 years, total                          1476 non-null   float64
 44  Graduation rate  Bachelor degree within 5 years, total                          1476 non-null   float64
 45  Graduation rate  Bachelor degree within 6 years, total                          1476 non-null   float64
 46  Percent of freshmen receiving any financial aid                                 1492 non-null   float64
 47  Percent of freshmen receiving federal grant aid                                 1492 non-null   float64
 48  Percent of freshmen receiving Pell grants                                       1492 non-null   float64
 49  Percent of freshmen receiving institutional grant aid                           1492 non-null   float64
 50  Percent of freshmen receiving student loan aid                                  1492 non-null   float64
 51  Endowment assets                                                                1534 non-null   float64
dtypes: float64(45), object(7)
memory usage: 635.2+ KB
In [ ]:
data.dropna(subset=['Rank'], inplace=True)
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 375 entries, 0 to 393
Data columns (total 52 columns):
 #   Column                                                                          Non-Null Count  Dtype  
---  ------                                                                          --------------  -----  
 0   Rank                                                                            375 non-null    float64
 1   Name                                                                            375 non-null    object 
 2   Tuition                                                                         375 non-null    float64
 3   Enrollment Numbers                                                              375 non-null    float64
 4   Applicants total                                                                367 non-null    float64
 5   Admissions total                                                                367 non-null    float64
 6   Enrolled total                                                                  367 non-null    float64
 7   Percent of freshmen submitting SAT scores                                       354 non-null    float64
 8   Percent of freshmen submitting ACT scores                                       355 non-null    float64
 9   SAT Critical Reading 25th percentile score                                      337 non-null    float64
 10  SAT Critical Reading 75th percentile score                                      337 non-null    float64
 11  SAT Math 25th percentile score                                                  342 non-null    float64
 12  SAT Math 75th percentile score                                                  342 non-null    float64
 13  SAT Writing 25th percentile score                                               207 non-null    float64
 14  SAT Writing 75th percentile score                                               207 non-null    float64
 15  ACT Composite 25th percentile score                                             341 non-null    float64
 16  ACT Composite 75th percentile score                                             341 non-null    float64
 17  State abbreviation                                                              375 non-null    object 
 18  Geographic region                                                               375 non-null    object 
 19  Control of institution                                                          375 non-null    object 
 20  Historically Black College or University                                        375 non-null    object 
 21  Degree of urbanization (Urban centric locale)                                   375 non-null    object 
 22  Carnegie Classification 2010: Basic                                             375 non-null    object 
 23  Total enrollment                                                                375 non-null    float64
 24  Full time enrollment                                                            375 non-null    float64
 25  Part time enrollment                                                            375 non-null    float64
 26  Undergraduate enrollment                                                        375 non-null    float64
 27  Graduate enrollment                                                             375 non-null    float64
 28  Full time undergraduate enrollment                                              375 non-null    float64
 29  Part time undergraduate enrollment                                              375 non-null    float64
 30  Percent of total enrollment that are Asian                                      375 non-null    float64
 31  Percent of total enrollment that are Black or African American                  375 non-null    float64
 32  Percent of total enrollment that are Hispanic/Latino                            375 non-null    float64
 33  Percent of total enrollment that are Native Hawaiian or Other Pacific Islander  375 non-null    float64
 34  Percent of total enrollment that are White                                      375 non-null    float64
 35  Percent of total enrollment that are two or more races                          375 non-null    float64
 36  Percent of total enrollment that are Nonresident Alien                          375 non-null    float64
 37  Percent of total enrollment that are women                                      375 non-null    float64
 38  Percent of undergraduate enrollment that are American Indian or Alaska Native   375 non-null    float64
 39  Number of first time undergraduates  in state                                   272 non-null    float64
 40  Number of first time undergraduates  out of state                               272 non-null    float64
 41  Number of first time undergraduates  foreign countries                          272 non-null    float64
 42  Number of first time undergraduates  residence unknown                          272 non-null    float64
 43  Graduation rate  Bachelor degree within 4 years, total                          372 non-null    float64
 44  Graduation rate  Bachelor degree within 5 years, total                          372 non-null    float64
 45  Graduation rate  Bachelor degree within 6 years, total                          372 non-null    float64
 46  Percent of freshmen receiving any financial aid                                 373 non-null    float64
 47  Percent of freshmen receiving federal grant aid                                 373 non-null    float64
 48  Percent of freshmen receiving Pell grants                                       373 non-null    float64
 49  Percent of freshmen receiving institutional grant aid                           373 non-null    float64
 50  Percent of freshmen receiving student loan aid                                  373 non-null    float64
 51  Endowment assets                                                                375 non-null    float64
dtypes: float64(45), object(7)
memory usage: 155.3+ KB
In [ ]:
#columns with no nulls
data.columns[data.isnull().sum()<=30]
Out[ ]:
Index(['Rank', 'Name', 'Tuition', 'Enrollment Numbers', 'Applicants total',
       'Admissions total', 'Enrolled total',
       'Percent of freshmen submitting SAT scores',
       'Percent of freshmen submitting ACT scores', 'State abbreviation',
       'Geographic region', 'Control of institution',
       'Historically Black College or University',
       'Degree of urbanization (Urban centric locale)',
       'Carnegie Classification 2010: Basic', 'Total enrollment',
       'Full time enrollment', 'Part time enrollment',
       'Undergraduate enrollment', 'Graduate enrollment',
       'Full time undergraduate enrollment',
       'Part time undergraduate enrollment',
       'Percent of total enrollment that are Asian',
       'Percent of total enrollment that are Black or African American',
       'Percent of total enrollment that are Hispanic/Latino',
       'Percent of total enrollment that are Native Hawaiian or Other Pacific Islander',
       'Percent of total enrollment that are White',
       'Percent of total enrollment that are two or more races',
       'Percent of total enrollment that are Nonresident Alien',
       'Percent of total enrollment that are women',
       'Percent of undergraduate enrollment that are American Indian or Alaska Native',
       'Graduation rate  Bachelor degree within 4 years, total',
       'Graduation rate  Bachelor degree within 5 years, total',
       'Graduation rate  Bachelor degree within 6 years, total',
       'Percent of freshmen receiving any financial aid',
       'Percent of freshmen receiving federal grant aid',
       'Percent of freshmen receiving Pell grants',
       'Percent of freshmen receiving institutional grant aid',
       'Percent of freshmen receiving student loan aid', 'Endowment assets'],
      dtype='object')
In [ ]:
df = data[[ 'Name','Rank','Tuition', 'Enrollment Numbers', 'Applicants total',
       'Admissions total',
       'Geographic region', 'Control of institution',
       'Historically Black College or University',
       'Degree of urbanization (Urban centric locale)',
       'Carnegie Classification 2010: Basic',
       'Undergraduate enrollment', 'Graduate enrollment',
       'Full time undergraduate enrollment',
       'Part time undergraduate enrollment',
       'Percent of total enrollment that are Asian',
       'Percent of total enrollment that are Black or African American',
       'Percent of total enrollment that are Hispanic/Latino',
       'Percent of total enrollment that are Native Hawaiian or Other Pacific Islander',
       'Percent of total enrollment that are White',
       'Percent of total enrollment that are women',
       'Percent of undergraduate enrollment that are American Indian or Alaska Native',
       'Graduation rate  Bachelor degree within 4 years, total',
       'Percent of freshmen receiving any financial aid',
       'Endowment assets']]
In [ ]:
df = df.dropna()
In [ ]:
df.dtypes
Out[ ]:
Name                                                                               object
Rank                                                                              float64
Tuition                                                                           float64
Enrollment Numbers                                                                float64
Applicants total                                                                  float64
Admissions total                                                                  float64
Geographic region                                                                  object
Control of institution                                                             object
Historically Black College or University                                           object
Degree of urbanization (Urban centric locale)                                      object
Carnegie Classification 2010: Basic                                                object
Undergraduate enrollment                                                          float64
Graduate enrollment                                                               float64
Full time undergraduate enrollment                                                float64
Part time undergraduate enrollment                                                float64
Percent of total enrollment that are Asian                                        float64
Percent of total enrollment that are Black or African American                    float64
Percent of total enrollment that are Hispanic/Latino                              float64
Percent of total enrollment that are Native Hawaiian or Other Pacific Islander    float64
Percent of total enrollment that are White                                        float64
Percent of total enrollment that are women                                        float64
Percent of undergraduate enrollment that are American Indian or Alaska Native     float64
Graduation rate  Bachelor degree within 4 years, total                            float64
Percent of freshmen receiving any financial aid                                   float64
Endowment assets                                                                  float64
dtype: object
In [ ]:
df
Out[ ]:
Name Rank Tuition Enrollment Numbers Applicants total Admissions total Geographic region Control of institution Historically Black College or University Degree of urbanization (Urban centric locale) Carnegie Classification 2010: Basic Undergraduate enrollment Graduate enrollment Full time undergraduate enrollment Part time undergraduate enrollment Percent of total enrollment that are Asian Percent of total enrollment that are Black or African American Percent of total enrollment that are Hispanic/Latino Percent of total enrollment that are Native Hawaiian or Other Pacific Islander Percent of total enrollment that are White Percent of total enrollment that are women Percent of undergraduate enrollment that are American Indian or Alaska Native Graduation rate Bachelor degree within 4 years, total Percent of freshmen receiving any financial aid Endowment assets
0 Princeton University 0.000 56010.000 4773.000 26499.000 1963.000 Mid East DE DC MD NJ NY PA Private not for profit No Suburb: Large Research Universities (very high research acti... 5323.000 2691.000 5244.000 79.000 15.000 6.000 7.000 0.000 45.000 45.000 0.000 88.000 60.000 2320421.000
1 Columbia University 1.000 63530.000 6170.000 31851.000 2362.000 Mid East DE DC MD NJ NY PA Private not for profit No City: Large Research Universities (very high research acti... 7970.000 18987.000 7374.000 596.000 13.000 5.000 8.000 0.000 36.000 51.000 1.000 86.000 57.000 316753.000
2 Harvard University 2.000 55587.000 5222.000 35023.000 2047.000 New England CT ME MA NH RI VT Private not for profit No City: Midsize Research Universities (very high research acti... 10534.000 17763.000 7240.000 3294.000 13.000 5.000 7.000 0.000 45.000 49.000 0.000 87.000 75.000 1392761.000
3 Massachusetts Institute of Technology 3.000 55878.000 4361.000 18989.000 1548.000 New England CT ME MA NH RI VT Private not for profit No City: Midsize Research Universities (very high research acti... 4528.000 6773.000 4499.000 29.000 16.000 3.000 9.000 0.000 34.000 37.000 0.000 84.000 87.000 980404.000
4 Yale University 4.000 59950.000 4703.000 28977.000 2043.000 New England CT ME MA NH RI VT Private not for profit No City: Midsize Research Universities (very high research acti... 5430.000 6679.000 5424.000 6.000 13.000 5.000 7.000 0.000 48.000 49.000 1.000 90.000 61.000 1528324.000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
386 Western Kentucky University 384.000 26496.000 15286.000 8526.000 7871.000 Southeast AL AR FL GA KY LA MS NC SC TN VA WV Public No City: Small Master's Colleges and Universities (larger pro... 17509.000 2939.000 13382.000 4127.000 1.000 10.000 2.000 0.000 77.000 58.000 0.000 25.000 93.000 945.000
387 Wichita State University 385.000 18166.000 12406.000 3492.000 3344.000 Plains IA KS MN MO NE ND SD Public No City: Large Research Universities (high research activity) 11670.000 2716.000 8807.000 2863.000 6.000 6.000 8.000 0.000 63.000 52.000 1.000 22.000 89.000 17845.000
388 William Carey University 386.000 14100.000 3264.000 771.000 376.000 Southeast AL AR FL GA KY LA MS NC SC TN VA WV Private not for profit No City: Small Master's Colleges and Universities (larger pro... 2257.000 1625.000 1886.000 371.000 3.000 27.000 2.000 0.000 64.000 66.000 1.000 46.000 92.000 2958.000
389 William Woods University 387.000 25930.000 873.000 897.000 674.000 Plains IA KS MN MO NE ND SD Private not for profit No Town: Distant Master's Colleges and Universities (larger pro... 1002.000 1134.000 843.000 159.000 1.000 3.000 1.000 0.000 84.000 68.000 0.000 45.000 100.000 11097.000
391 Wingate University 389.000 40170.000 2683.000 5323.000 4221.000 Southeast AL AR FL GA KY LA MS NC SC TN VA WV Private not for profit No Suburb: Large Master's Colleges and Universities (smaller pr... 2009.000 993.000 1953.000 56.000 2.000 14.000 2.000 0.000 62.000 60.000 1.000 47.000 99.000 17933.000

365 rows × 25 columns

In [ ]:
df['Enrollment Numbers'] = df['Undergraduate enrollment'] + df['Graduate enrollment']
df['Percent undergraduate'] = df['Undergraduate enrollment']/(df['Enrollment Numbers'])
df['Percent fulltime'] = df['Full time undergraduate enrollment']/(df['Undergraduate enrollment'])
df['Percent admitted'] = df['Admissions total']/(df['Applicants total'])
In [ ]:
df['Geographic region'].unique()
Out[ ]:
array(['Mid East DE DC MD NJ NY PA', 'New England CT ME MA NH RI VT',
       'Far West AK CA HI NV OR WA', 'Great Lakes IL IN MI OH WI',
       'Southeast AL AR FL GA KY LA MS NC SC TN VA WV',
       'Plains IA KS MN MO NE ND SD', 'Southwest AZ NM OK TX',
       'Rocky Mountains CO ID MT UT WY'], dtype=object)
In [ ]:
df['Control of institution'].unique()
Out[ ]:
array(['Private not for profit', 'Public'], dtype=object)
In [ ]:
df['Historically Black College or University'].unique()
Out[ ]:
array(['No', 'Yes'], dtype=object)
In [ ]:
df['Degree of urbanization (Urban centric locale)'].unique()
Out[ ]:
array(['Suburb: Large', 'City: Large', 'City: Midsize', 'City: Small',
       'Town: Remote', 'Suburb: Midsize', 'Suburb: Small', 'Town: Fringe',
       'Rural: Fringe', 'Town: Distant'], dtype=object)
In [ ]:
df['Carnegie Classification 2010: Basic'].unique()
Out[ ]:
array(['Research Universities (very high research activity)',
       'Research Universities (high research activity)',
       'Doctoral/Research Universities',
       "Master's Colleges and Universities (larger programs)",
       "Master's Colleges and Universities (smaller programs)",
       "Master's Colleges and Universities (medium programs)",
       'Baccalaureate Colleges Diverse Fields',
       'Baccalaureate Colleges Arts & Sciences'], dtype=object)
In [ ]:
df.columns
Out[ ]:
Index(['Name', 'Rank', 'Tuition', 'Enrollment Numbers', 'Applicants total',
       'Admissions total', 'Geographic region', 'Control of institution',
       'Historically Black College or University',
       'Degree of urbanization (Urban centric locale)',
       'Carnegie Classification 2010: Basic', 'Undergraduate enrollment',
       'Graduate enrollment', 'Full time undergraduate enrollment',
       'Part time undergraduate enrollment',
       'Percent of total enrollment that are Asian',
       'Percent of total enrollment that are Black or African American',
       'Percent of total enrollment that are Hispanic/Latino',
       'Percent of total enrollment that are Native Hawaiian or Other Pacific Islander',
       'Percent of total enrollment that are White',
       'Percent of total enrollment that are women',
       'Percent of undergraduate enrollment that are American Indian or Alaska Native',
       'Graduation rate  Bachelor degree within 4 years, total',
       'Percent of freshmen receiving any financial aid', 'Endowment assets',
       'Percent undergraduate', 'Percent fulltime', 'Percent admitted'],
      dtype='object')
In [ ]:
df.describe()
Out[ ]:
Rank Tuition Enrollment Numbers Applicants total Admissions total Undergraduate enrollment Graduate enrollment Full time undergraduate enrollment Part time undergraduate enrollment Percent of total enrollment that are Asian Percent of total enrollment that are Black or African American Percent of total enrollment that are Hispanic/Latino Percent of total enrollment that are Native Hawaiian or Other Pacific Islander Percent of total enrollment that are White Percent of total enrollment that are women Percent of undergraduate enrollment that are American Indian or Alaska Native Graduation rate Bachelor degree within 4 years, total Percent of freshmen receiving any financial aid Endowment assets Percent undergraduate Percent fulltime Percent admitted
count 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000
mean 187.227 34593.723 16898.238 13867.986 7260.882 12326.797 4571.441 10476.556 1850.241 6.192 10.605 8.907 0.074 58.414 55.340 0.321 42.118 86.299 62104.348 0.708 0.852 0.622
std 110.507 13681.923 12481.095 12524.285 5869.563 9643.301 3992.240 8268.793 2311.730 6.417 13.882 9.987 0.609 18.201 9.366 0.744 21.973 12.326 199648.073 0.135 0.112 0.208
min 0.000 -1.000 1225.000 84.000 83.000 973.000 214.000 611.000 0.000 0.000 0.000 1.000 0.000 0.000 23.000 0.000 0.000 44.000 0.000 0.201 0.373 0.057
25% 93.000 24110.000 6747.000 4801.000 2764.000 4428.000 1887.000 3815.000 359.000 2.000 4.000 3.000 0.000 48.000 51.000 0.000 24.000 79.000 6274.000 0.632 0.785 0.507
50% 185.000 32299.000 13868.000 10525.000 5529.000 9718.000 3335.000 8040.000 1091.000 4.000 6.000 6.000 0.000 62.000 55.000 0.000 38.000 90.000 14574.000 0.745 0.880 0.655
75% 277.000 44382.000 24629.000 18989.000 10405.000 18615.000 5789.000 15879.000 2664.000 8.000 12.000 10.000 0.000 72.000 60.000 0.000 59.000 96.000 36852.000 0.813 0.937 0.774
max 389.000 63530.000 77338.000 72676.000 35815.000 51333.000 29874.000 40020.000 21553.000 39.000 91.000 79.000 10.000 94.000 95.000 6.000 90.000 100.000 2320421.000 0.918 1.000 1.000
In [ ]:
df_small = df[['Name', 'Rank', 'Tuition', 'Enrollment Numbers', 'Geographic region', 'Control of institution',
       'Historically Black College or University',
       'Degree of urbanization (Urban centric locale)',
       'Carnegie Classification 2010: Basic',
       'Percent of total enrollment that are White',
       'Percent of total enrollment that are women',
       'Graduation rate  Bachelor degree within 4 years, total',
       'Percent of freshmen receiving any financial aid', 'Endowment assets',
       'Percent undergraduate', 'Percent fulltime', 'Percent admitted']]

Clustering

In [ ]:
sns.pairplot(df_small)
Out[ ]:
<seaborn.axisgrid.PairGrid at 0x1b6d90a7650>
In [ ]:
sns.pairplot(df_small, hue = 'Control of institution')
Out[ ]:
<seaborn.axisgrid.PairGrid at 0x1b6ad740b50>

Private - public split does have a pretty clear distinction for a lot of the graphs, particularly for tuition, enrollment size, and percent undergraduate

In [ ]:
sns.pairplot(df_small, hue = 'Degree of urbanization (Urban centric locale)')
Out[ ]:
<seaborn.axisgrid.PairGrid at 0x1b6b639fb50>

Strong blend -- no super distinct splits.

In [ ]:
sns.pairplot(df_small, hue = 'Carnegie Classification 2010: Basic')
Out[ ]:
<seaborn.axisgrid.PairGrid at 0x1b706e85290>

Very high research universities tend to appear near each other on many graphs, as do high research unviersities and large masters programs. However, there is still a lot of blending and non-destinct clustering.

In [ ]:
sns.pairplot(df_small, hue = 'Geographic region')
Out[ ]:
<seaborn.axisgrid.PairGrid at 0x1b6999b5d50>

Very blended, little to no distinct splits.

Correlations -- many things correlate with "total," which makes sense since that is just the sum of all the stats

In [ ]:
df_small['Geographic region num'] = df_small['Geographic region'].astype('category').cat.codes
df_small['Control of institution num'] = df_small['Control of institution'].astype('category').cat.codes
df_small['Historically Black College or University num'] = df_small['Historically Black College or University'].astype('category').cat.codes
df_small['Degree of urbanization (Urban centric locale) num'] = df_small['Degree of urbanization (Urban centric locale)'].astype('category').cat.codes
df_small['Carnegie Classification 2010: Basic num'] = df_small['Carnegie Classification 2010: Basic'].astype('category').cat.codes
C:\Users\jesse\AppData\Local\Temp\ipykernel_16476\2467532273.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_small['Geographic region num'] = df_small['Geographic region'].astype('category').cat.codes
C:\Users\jesse\AppData\Local\Temp\ipykernel_16476\2467532273.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_small['Control of institution num'] = df_small['Control of institution'].astype('category').cat.codes
C:\Users\jesse\AppData\Local\Temp\ipykernel_16476\2467532273.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_small['Historically Black College or University num'] = df_small['Historically Black College or University'].astype('category').cat.codes
C:\Users\jesse\AppData\Local\Temp\ipykernel_16476\2467532273.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_small['Degree of urbanization (Urban centric locale) num'] = df_small['Degree of urbanization (Urban centric locale)'].astype('category').cat.codes
C:\Users\jesse\AppData\Local\Temp\ipykernel_16476\2467532273.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_small['Carnegie Classification 2010: Basic num'] = df_small['Carnegie Classification 2010: Basic'].astype('category').cat.codes
In [ ]:
df_small
Out[ ]:
Name Rank Tuition Enrollment Numbers Geographic region Control of institution Historically Black College or University Degree of urbanization (Urban centric locale) Carnegie Classification 2010: Basic Percent of total enrollment that are White Percent of total enrollment that are women Graduation rate Bachelor degree within 4 years, total Percent of freshmen receiving any financial aid Endowment assets Percent undergraduate Percent fulltime Percent admitted Geographic region num Control of institution num Historically Black College or University num Degree of urbanization (Urban centric locale) num Carnegie Classification 2010: Basic num
0 Princeton University 0.000 56010.000 8014.000 Mid East DE DC MD NJ NY PA Private not for profit No Suburb: Large Research Universities (very high research acti... 45.000 45.000 88.000 60.000 2320421.000 0.664 0.985 0.074 2 0 0 4 7
1 Columbia University 1.000 63530.000 26957.000 Mid East DE DC MD NJ NY PA Private not for profit No City: Large Research Universities (very high research acti... 36.000 51.000 86.000 57.000 316753.000 0.296 0.925 0.074 2 0 0 0 7
2 Harvard University 2.000 55587.000 28297.000 New England CT ME MA NH RI VT Private not for profit No City: Midsize Research Universities (very high research acti... 45.000 49.000 87.000 75.000 1392761.000 0.372 0.687 0.058 3 0 0 1 7
3 Massachusetts Institute of Technology 3.000 55878.000 11301.000 New England CT ME MA NH RI VT Private not for profit No City: Midsize Research Universities (very high research acti... 34.000 37.000 84.000 87.000 980404.000 0.401 0.994 0.082 3 0 0 1 7
4 Yale University 4.000 59950.000 12109.000 New England CT ME MA NH RI VT Private not for profit No City: Midsize Research Universities (very high research acti... 48.000 49.000 90.000 61.000 1528324.000 0.448 0.999 0.071 3 0 0 1 7
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
386 Western Kentucky University 384.000 26496.000 20448.000 Southeast AL AR FL GA KY LA MS NC SC TN VA WV Public No City: Small Master's Colleges and Universities (larger pro... 77.000 58.000 25.000 93.000 945.000 0.856 0.764 0.923 6 1 0 2 3
387 Wichita State University 385.000 18166.000 14386.000 Plains IA KS MN MO NE ND SD Public No City: Large Research Universities (high research activity) 63.000 52.000 22.000 89.000 17845.000 0.811 0.755 0.958 4 1 0 0 6
388 William Carey University 386.000 14100.000 3882.000 Southeast AL AR FL GA KY LA MS NC SC TN VA WV Private not for profit No City: Small Master's Colleges and Universities (larger pro... 64.000 66.000 46.000 92.000 2958.000 0.581 0.836 0.488 6 0 0 2 3
389 William Woods University 387.000 25930.000 2136.000 Plains IA KS MN MO NE ND SD Private not for profit No Town: Distant Master's Colleges and Universities (larger pro... 84.000 68.000 45.000 100.000 11097.000 0.469 0.841 0.751 4 0 0 7 3
391 Wingate University 389.000 40170.000 3002.000 Southeast AL AR FL GA KY LA MS NC SC TN VA WV Private not for profit No Suburb: Large Master's Colleges and Universities (smaller pr... 62.000 60.000 47.000 99.000 17933.000 0.669 0.972 0.793 6 0 0 4 5

365 rows × 22 columns

In [ ]:
plt.figure(figsize=(15,8))
sns.heatmap(df_small.corr(numeric_only=True),annot = True)
## or you can drop the non-numeric columns instead of setting numerically_only to True
## sns.heatmap(df.drop["Name","Type 1","Type 2"].corr(numeric_only= True),annot = True)
Out[ ]:
<Axes: >

A lot of correlation between variables, including rank, tuition, and graduating within four years. Other strongly correlated variables include percent admitted, and endowment assets. The categorical variables are also strongly correlated with several things. PCA will thus be valuable for clustering.

Explore Visualization using PCA¶

PCA (Principal Component Analysis) is a dimension reduction technique that consolidates key information from the features of a dataset into a new set of features that are uncorrelated to eachother and clearly explain a certain variance in the dataset.

Make a dictionary for converting the type columns (and also knowing which values relate to what type when we come to look at it again later).

In [ ]:
#turn all columns into X
X = df_small.drop(['Name', 'Rank', 'Geographic region', 'Control of institution',
       'Historically Black College or University',
       'Degree of urbanization (Urban centric locale)',
       'Carnegie Classification 2010: Basic'], axis=1)
In [ ]:
#Create PCA model
pca = PCA(n_components=2)
pca_mdl = pca.fit_transform(X)
pca_df = pd.DataFrame(pca_mdl)
In [ ]:
sns.scatterplot(x = pca_df[0], y = pca_df[1])
Out[ ]:
<Axes: xlabel='0', ylabel='1'>

... okay weird there are definitely some outliers and it looks wonky. How to handle these? We will test outlier removal and data scaling.

In [ ]:
X.columns
Out[ ]:
Index(['Tuition', 'Enrollment Numbers',
       'Percent of total enrollment that are White',
       'Percent of total enrollment that are women',
       'Graduation rate  Bachelor degree within 4 years, total',
       'Percent of freshmen receiving any financial aid', 'Endowment assets',
       'Percent undergraduate', 'Percent fulltime', 'Percent admitted',
       'Geographic region num', 'Control of institution num',
       'Historically Black College or University num',
       'Degree of urbanization (Urban centric locale) num',
       'Carnegie Classification 2010: Basic num'],
      dtype='object')
In [ ]:
from sklearn.preprocessing import StandardScaler
scaler = preprocessing.StandardScaler()
numerical_cols = df_small.select_dtypes(include=['float64', 'int64']).columns
scaler = StandardScaler()
df_small[numerical_cols] = scaler.fit_transform(df_small[numerical_cols])
df_small.describe()
C:\Users\jesse\AppData\Local\Temp\ipykernel_16476\581297193.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_small[numerical_cols] = scaler.fit_transform(df_small[numerical_cols])
Out[ ]:
Rank Tuition Enrollment Numbers Percent of total enrollment that are White Percent of total enrollment that are women Graduation rate Bachelor degree within 4 years, total Percent of freshmen receiving any financial aid Endowment assets Percent undergraduate Percent fulltime Percent admitted Geographic region num Control of institution num Historically Black College or University num Degree of urbanization (Urban centric locale) num Carnegie Classification 2010: Basic num
count 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000
mean 0.000 0.000 0.000 -0.000 0.000 0.000 0.000 0.000 0.000 -0.000 0.000 3.603 0.540 0.027 2.282 4.764
std 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 2.365 0.499 0.163 2.510 2.007
min -1.697 -2.532 -1.257 -3.214 -3.458 -1.919 -3.436 -0.311 -3.773 -4.263 -2.718 0.000 0.000 0.000 0.000 0.000
25% -0.854 -0.767 -0.814 -0.573 -0.464 -0.826 -0.593 -0.280 -0.565 -0.595 -0.553 2.000 0.000 0.000 0.000 3.000
50% -0.020 -0.168 -0.243 0.197 -0.036 -0.188 0.301 -0.238 0.276 0.255 0.158 3.000 1.000 0.000 1.000 6.000
75% 0.813 0.716 0.620 0.748 0.498 0.769 0.788 -0.127 0.784 0.758 0.733 6.000 1.000 0.000 4.000 7.000
max 1.828 2.118 4.849 1.958 4.241 2.182 1.113 11.327 1.565 1.320 1.819 7.000 1.000 1.000 9.000 7.000
In [ ]:
#scale the data without changing the column names
scaler = preprocessing.StandardScaler()
scaled_X = scaler.fit_transform(X)
scaled_X = pd.DataFrame(scaled_X, columns=X.columns)
scaled_X.describe()
Out[ ]:
Tuition Enrollment Numbers Percent of total enrollment that are White Percent of total enrollment that are women Graduation rate Bachelor degree within 4 years, total Percent of freshmen receiving any financial aid Endowment assets Percent undergraduate Percent fulltime Percent admitted Geographic region num Control of institution num Historically Black College or University num Degree of urbanization (Urban centric locale) num Carnegie Classification 2010: Basic num
count 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000 365.000
mean 0.000 0.000 0.000 -0.000 0.000 -0.000 0.000 -0.000 0.000 0.000 -0.000 0.000 0.000 -0.000 -0.000
std 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001 1.001
min -2.532 -1.257 -3.214 -3.458 -1.919 -3.436 -0.311 -3.773 -4.263 -2.718 -1.526 -1.083 -0.168 -0.910 -2.378
25% -0.767 -0.814 -0.573 -0.464 -0.826 -0.593 -0.280 -0.565 -0.595 -0.553 -0.679 -1.083 -0.168 -0.910 -0.880
50% -0.168 -0.243 0.197 -0.036 -0.188 0.301 -0.238 0.276 0.255 0.158 -0.255 0.923 -0.168 -0.511 0.617
75% 0.716 0.620 0.748 0.498 0.769 0.788 -0.127 0.784 0.758 0.733 1.015 0.923 -0.168 0.685 1.116
max 2.118 4.849 1.958 4.241 2.182 1.113 11.327 1.565 1.320 1.819 1.439 0.923 5.958 2.680 1.116
In [ ]:
pca = PCA(n_components=2)
pca_mdl = pca.fit_transform(df_small[['Percent of total enrollment that are White',
       'Percent of total enrollment that are women',
       'Graduation rate  Bachelor degree within 4 years, total',
       'Percent of freshmen receiving any financial aid', 'Endowment assets',
       'Percent undergraduate', 'Percent fulltime', 'Percent admitted',
       'Geographic region num', 'Control of institution num',
       'Historically Black College or University num',
       'Degree of urbanization (Urban centric locale) num',
       'Carnegie Classification 2010: Basic num']])
pca_df = pd.DataFrame(pca_mdl)
sns.scatterplot(x = pca_df[0], y = pca_df[1])
Out[ ]:
<Axes: xlabel='0', ylabel='1'>
In [ ]:
pca = PCA(n_components=2)
pca_mdl = pca.fit_transform(scaled_X)
pca_df = pd.DataFrame(pca_mdl)
sns.scatterplot(x = pca_df[0], y = pca_df[1])

Much much better! Let's explore these for clustering.

In [ ]:
# from sklearn.decomposition import PCA
# from scipy.spatial import distance
# import numpy as np

# # Create PCA model
# pca = PCA(n_components=2)
# pca_mdl = pca.fit_transform(scaled_X)

# # Calculate the distance from each point to the origin
# distances = np.sqrt((pca_mdl**2).sum(axis=1))

# # Calculate the mean and standard deviation of the distances
# mean_distance = np.mean(distances)
# std_distance = np.std(distances)

# # Define outliers to be any point that is more than 3 standard deviations from the mean
# outliers = pca_mdl[distances > mean_distance + 3*std_distance]
In [ ]:
#use outliers to find the index of the outliers
outliers_index = []
for i in range(len(pca_mdl)):
    if pca_mdl[i] in outliers:
        outliers_index.append(i)

#create a new df without the outliers
new_df = scaled_X.drop(outliers_index).copy()
In [ ]:
#Create PCA model
pca = PCA(n_components=2)
pca_mdl = pca.fit_transform(new_df)
pca_df = pd.DataFrame(pca_mdl)
sns.scatterplot(x = pca_df[0], y = pca_df[1])
Out[ ]:
<Axes: xlabel='0', ylabel='1'>
In [ ]:
#print outliers_index from scaled_X
scaled_X.iloc[outliers_index]
Out[ ]:
Tuition Enrollment Numbers Percent of total enrollment that are White Percent of total enrollment that are women Graduation rate Bachelor degree within 4 years, total Percent of freshmen receiving any financial aid Endowment assets Percent undergraduate Percent fulltime Percent admitted Geographic region num Control of institution num Historically Black College or University num Degree of urbanization (Urban centric locale) num Carnegie Classification 2010: Basic num

K-Means Clustering¶

We will begin our modeling with K-Means Clustering.

Briefly explain how the K-Means clustering model works.

K-means clustering is a centroid clustering algorithm that partitions data into k number of clusters and then assigns each data point to the nearest cluster based on the shortest distance to the centroid (mean center point of a cluster).

Remember how we determine the best number of clusters (if we can't just manually look at it and decide)?

We look at the variance -- or, the sum of squared distances between the observations and their centroids. Note: "inertia" is the "within-cluster sum-of-squares criterion." See scikit learn documentation.

In [ ]:
inertia = []
for k in range(1,8):
    kmeans = KMeans(n_clusters=k, random_state=1).fit(scaled_X)
    inertia.append(np.sqrt(kmeans.inertia_))
c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(

Here, we see that the variance decreases significantly until 2, and then starts to decrease at a slower rate afterwards. Therefore, 2 is our preferred number of clusters.

In [ ]:
plt.plot(range(1, 8), inertia, marker='s');
plt.xlabel('$k$')
plt.ylabel('Variance')
Out[ ]:
Text(0, 0.5, 'Variance')

In this case, what is the optimal number of clusters and why?

As explained above, the variance decreases pretty consistently until 3, so we are going to start with 4 clusters and look at those.

In [ ]:
#create KMeans model
kmeans = KMeans(n_clusters=3, random_state=1).fit(scaled_X)
c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(

Now that we have fit our k-means clusters, let's just find what value (0 or 1, since we have set K=2) each row of data is so we can visualize it.

In [ ]:
y = kmeans.fit_predict(scaled_X)
c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(

We are reusing the PCA (dimensionality reduction) data frame for the sake of visualizing 2-dimensional data (rather than 5).

In [ ]:
sns.scatterplot(x = pca_df[0], y = pca_df[1], hue=y)
Out[ ]:
<Axes: xlabel='0', ylabel='1'>

We could also try plotting individual features to take a look.

In [ ]:
sns.scatterplot(x = df_small['Rank'], y = df_small['Tuition'], hue=y)
Out[ ]:
<Axes: xlabel='Rank', ylabel='Tuition'>

Let's add our clusters back to the original DataFrame so we can take a look at some of the items.

In [ ]:
df_small['Cluster'] = y
df_small
C:\Users\jesse\AppData\Local\Temp\ipykernel_16476\3172443776.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_small['Cluster'] = y
Out[ ]:
Name Rank Tuition Enrollment Numbers Geographic region Control of institution Historically Black College or University Degree of urbanization (Urban centric locale) Carnegie Classification 2010: Basic Percent of total enrollment that are White Percent of total enrollment that are women Graduation rate Bachelor degree within 4 years, total Percent of freshmen receiving any financial aid Endowment assets Percent undergraduate Percent fulltime Percent admitted Geographic region num Control of institution num Historically Black College or University num Degree of urbanization (Urban centric locale) num Carnegie Classification 2010: Basic num Cluster
0 Princeton University -1.697 1.567 -0.713 Mid East DE DC MD NJ NY PA Private not for profit No Suburb: Large Research Universities (very high research acti... -0.738 -1.106 2.091 -2.137 11.327 -0.326 1.188 -2.636 2 0 0 4 7 2
1 Columbia University -1.688 2.118 0.807 Mid East DE DC MD NJ NY PA Private not for profit No City: Large Research Universities (very high research acti... -1.233 -0.464 2.000 -2.380 1.277 -3.068 0.654 -2.635 2 0 0 0 7 2
2 Harvard University -1.678 1.536 0.915 New England CT ME MA NH RI VT Private not for profit No City: Midsize Research Universities (very high research acti... -0.738 -0.678 2.045 -0.918 6.674 -2.498 -1.464 -2.711 3 0 0 1 7 2
3 Massachusetts Institute of Technology -1.669 1.558 -0.449 New England CT ME MA NH RI VT Private not for profit No City: Midsize Research Universities (very high research acti... -1.343 -1.961 1.909 0.057 4.606 -2.287 1.263 -2.600 3 0 0 1 7 2
4 Yale University -1.660 1.856 -0.384 New England CT ME MA NH RI VT Private not for profit No City: Midsize Research Universities (very high research acti... -0.573 -0.678 2.182 -2.055 7.354 -1.931 1.310 -2.653 3 0 0 1 7 2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
386 Western Kentucky University 1.783 -0.593 0.285 Southeast AL AR FL GA KY LA MS NC SC TN VA WV Public No City: Small Master's Colleges and Universities (larger pro... 1.023 0.284 -0.780 0.544 -0.307 1.103 -0.779 1.450 6 1 0 2 3 0
387 Wichita State University 1.792 -1.202 -0.202 Plains IA KS MN MO NE ND SD Public No City: Large Research Universities (high research activity) 0.252 -0.357 -0.917 0.219 -0.222 0.767 -0.864 1.615 4 1 0 0 6 0
388 William Carey University 1.801 -1.500 -1.044 Southeast AL AR FL GA KY LA MS NC SC TN VA WV Private not for profit No City: Small Master's Colleges and Universities (larger pro... 0.307 1.140 0.177 0.463 -0.297 -0.942 -0.143 -0.646 6 0 0 2 3 1
389 William Woods University 1.810 -0.634 -1.184 Plains IA KS MN MO NE ND SD Private not for profit No Town: Distant Master's Colleges and Universities (larger pro... 1.408 1.354 0.131 1.113 -0.256 -1.777 -0.093 0.623 4 0 0 7 3 1
391 Wingate University 1.828 0.408 -1.115 Southeast AL AR FL GA KY LA MS NC SC TN VA WV Private not for profit No Suburb: Large Master's Colleges and Universities (smaller pr... 0.197 0.498 0.222 1.032 -0.222 -0.289 1.072 0.823 6 0 0 4 5 1

365 rows × 23 columns

In [ ]:
sns.pairplot(df_small, hue = 'Cluster')
Out[ ]:
<seaborn.axisgrid.PairGrid at 0x1b6918e9290>

Making an interactive scatterplot (so it is easier to hover over individual data points.) Also note that the x- and y-axis are our PCA values (from dimensionality reduction).
Below, we concat the dataframe along with the PCA values so that we can visualize properly. hover_data allows us to specify which columns we want to look at when hovering over each point.


Agglomerative Clustering¶

Let's try agglomerative clustering with the same dataset as what we did above to see how it differs. But first, can you give a brief description of Agglomerative Clustering?

Agglomerative clustering is hierarchichal instead of centroid, and instead creates a top down or bottom up tree of clusters. Each point begins as its own cluster, then the closest pairs are merged. This repeats until a crterion is met.

In [ ]:
AgglomerativeClustering?
Init signature:
AgglomerativeClustering(
    n_clusters=2,
    *,
    affinity='deprecated',
    metric=None,
    memory=None,
    connectivity=None,
    compute_full_tree='auto',
    linkage='ward',
    distance_threshold=None,
    compute_distances=False,
)
Docstring:     
Agglomerative Clustering.

Recursively merges pair of clusters of sample data; uses linkage distance.

Read more in the :ref:`User Guide <hierarchical_clustering>`.

Parameters
----------
n_clusters : int or None, default=2
    The number of clusters to find. It must be ``None`` if
    ``distance_threshold`` is not ``None``.

affinity : str or callable, default='euclidean'
    The metric to use when calculating distance between instances in a
    feature array. If metric is a string or callable, it must be one of
    the options allowed by :func:`sklearn.metrics.pairwise_distances` for
    its metric parameter.
    If linkage is "ward", only "euclidean" is accepted.
    If "precomputed", a distance matrix (instead of a similarity matrix)
    is needed as input for the fit method.

    .. deprecated:: 1.2
        `affinity` was deprecated in version 1.2 and will be renamed to
        `metric` in 1.4.

metric : str or callable, default=None
    Metric used to compute the linkage. Can be "euclidean", "l1", "l2",
    "manhattan", "cosine", or "precomputed". If set to `None` then
    "euclidean" is used. If linkage is "ward", only "euclidean" is
    accepted. If "precomputed", a distance matrix is needed as input for
    the fit method.

    .. versionadded:: 1.2

memory : str or object with the joblib.Memory interface, default=None
    Used to cache the output of the computation of the tree.
    By default, no caching is done. If a string is given, it is the
    path to the caching directory.

connectivity : array-like or callable, default=None
    Connectivity matrix. Defines for each sample the neighboring
    samples following a given structure of the data.
    This can be a connectivity matrix itself or a callable that transforms
    the data into a connectivity matrix, such as derived from
    `kneighbors_graph`. Default is ``None``, i.e, the
    hierarchical clustering algorithm is unstructured.

compute_full_tree : 'auto' or bool, default='auto'
    Stop early the construction of the tree at ``n_clusters``. This is
    useful to decrease computation time if the number of clusters is not
    small compared to the number of samples. This option is useful only
    when specifying a connectivity matrix. Note also that when varying the
    number of clusters and using caching, it may be advantageous to compute
    the full tree. It must be ``True`` if ``distance_threshold`` is not
    ``None``. By default `compute_full_tree` is "auto", which is equivalent
    to `True` when `distance_threshold` is not `None` or that `n_clusters`
    is inferior to the maximum between 100 or `0.02 * n_samples`.
    Otherwise, "auto" is equivalent to `False`.

linkage : {'ward', 'complete', 'average', 'single'}, default='ward'
    Which linkage criterion to use. The linkage criterion determines which
    distance to use between sets of observation. The algorithm will merge
    the pairs of cluster that minimize this criterion.

    - 'ward' minimizes the variance of the clusters being merged.
    - 'average' uses the average of the distances of each observation of
      the two sets.
    - 'complete' or 'maximum' linkage uses the maximum distances between
      all observations of the two sets.
    - 'single' uses the minimum of the distances between all observations
      of the two sets.

    .. versionadded:: 0.20
        Added the 'single' option

distance_threshold : float, default=None
    The linkage distance threshold at or above which clusters will not be
    merged. If not ``None``, ``n_clusters`` must be ``None`` and
    ``compute_full_tree`` must be ``True``.

    .. versionadded:: 0.21

compute_distances : bool, default=False
    Computes distances between clusters even if `distance_threshold` is not
    used. This can be used to make dendrogram visualization, but introduces
    a computational and memory overhead.

    .. versionadded:: 0.24

Attributes
----------
n_clusters_ : int
    The number of clusters found by the algorithm. If
    ``distance_threshold=None``, it will be equal to the given
    ``n_clusters``.

labels_ : ndarray of shape (n_samples)
    Cluster labels for each point.

n_leaves_ : int
    Number of leaves in the hierarchical tree.

n_connected_components_ : int
    The estimated number of connected components in the graph.

    .. versionadded:: 0.21
        ``n_connected_components_`` was added to replace ``n_components_``.

n_features_in_ : int
    Number of features seen during :term:`fit`.

    .. versionadded:: 0.24

feature_names_in_ : ndarray of shape (`n_features_in_`,)
    Names of features seen during :term:`fit`. Defined only when `X`
    has feature names that are all strings.

    .. versionadded:: 1.0

children_ : array-like of shape (n_samples-1, 2)
    The children of each non-leaf node. Values less than `n_samples`
    correspond to leaves of the tree which are the original samples.
    A node `i` greater than or equal to `n_samples` is a non-leaf
    node and has children `children_[i - n_samples]`. Alternatively
    at the i-th iteration, children[i][0] and children[i][1]
    are merged to form node `n_samples + i`.

distances_ : array-like of shape (n_nodes-1,)
    Distances between nodes in the corresponding place in `children_`.
    Only computed if `distance_threshold` is used or `compute_distances`
    is set to `True`.

See Also
--------
FeatureAgglomeration : Agglomerative clustering but for features instead of
    samples.
ward_tree : Hierarchical clustering with ward linkage.

Examples
--------
>>> from sklearn.cluster import AgglomerativeClustering
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [4, 2], [4, 4], [4, 0]])
>>> clustering = AgglomerativeClustering().fit(X)
>>> clustering
AgglomerativeClustering()
>>> clustering.labels_
array([1, 1, 1, 0, 0, 0])
File:           c:\users\jesse\anaconda3\lib\site-packages\sklearn\cluster\_agglomerative.py
Type:           type
Subclasses:     FeatureAgglomeration

We have already done some pre-processing, but to keep things together for this practice, lets put them here again! We will be using the same "X" from K-Means with HP, Attack, Defense, Special Attack and Special Defense with Speed as well.

In [ ]:
X = df[['HP','Attack','Defense','Sp. Atk','Sp. Def','Speed']]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
c:\Users\jesse\Desktop\project\test.ipynb Cell 95 line 1
----> <a href='vscode-notebook-cell:/c%3A/Users/jesse/Desktop/project/test.ipynb#Y212sZmlsZQ%3D%3D?line=0'>1</a> X = df[['HP','Attack','Defense','Sp. Atk','Sp. Def','Speed']]

File c:\Users\jesse\anaconda3\Lib\site-packages\pandas\core\frame.py:3813, in DataFrame.__getitem__(self, key)
   3811     if is_iterator(key):
   3812         key = list(key)
-> 3813     indexer = self.columns._get_indexer_strict(key, "columns")[1]
   3815 # take() does not accept boolean indexers
   3816 if getattr(indexer, "dtype", None) == bool:

File c:\Users\jesse\anaconda3\Lib\site-packages\pandas\core\indexes\base.py:6070, in Index._get_indexer_strict(self, key, axis_name)
   6067 else:
   6068     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6070 self._raise_if_missing(keyarr, indexer, axis_name)
   6072 keyarr = self.take(indexer)
   6073 if isinstance(key, Index):
   6074     # GH 42790 - Preserve name from an Index

File c:\Users\jesse\anaconda3\Lib\site-packages\pandas\core\indexes\base.py:6130, in Index._raise_if_missing(self, key, indexer, axis_name)
   6128     if use_interval_msg:
   6129         key = list(key)
-> 6130     raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   6132 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
   6133 raise KeyError(f"{not_found} not in index")

KeyError: "None of [Index(['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed'], dtype='object')] are in the [columns]"

Let's figure out how many clusters is optimal for this model. Agglomerative Clustering used a dendrogram to determine this number!

In [ ]:
#Create and display a dendrogram
import scipy.cluster.hierarchy as shc
plt.figure(figsize=(10, 7))  
plt.title('Dendrogram')
plt.xlabel('Pokemon')
plt.ylabel('Euclidean distances')
plt.axhline(y=825, color='r', linestyle='--')
plt.axhline(y=1575, color='r', linestyle='--')
dend = shc.dendrogram(shc.linkage(X, method='ward'))

To read a dendrogram to find the optimal number of clusters, find the section with the highest width. the number of lines (in this example the blue lines) intersecting the section is the optimal number of clusters. Can you tell how many clusters is the optimal amount?

The largest width on this graph is at the final merge at the very top, so the optimal number of clusters is 2.

After determining what the optimal number of clusters is, input it into the model implementation below!

In [ ]:
#Implement model
agglo = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')  

Now lets fit and create some prediction to visualize the clusters!

In [ ]:
y_agglo = agglo.fit_predict(X)
c:\Users\jesse\anaconda3\Lib\site-packages\sklearn\cluster\_agglomerative.py:1005: FutureWarning:

Attribute `affinity` was deprecated in version 1.2 and will be removed in 1.4. Use `metric` instead

Now let's visualize! We will once again be using PCA to do so.

In [ ]:
sns.scatterplot(x = pca_df[0], y = pca_df[1], hue=y_agglo)
<Axes: xlabel='0', ylabel='1'>

Now lets look again at K-Means visual again to compare

In [ ]:
sns.scatterplot(x = pca_df[0], y = pca_df[1], hue=y)
<Axes: xlabel='0', ylabel='1'>

Can you note any differences or similarities you may see?

The split in k-means is much cleaner than in agglomerative. There is barely any overlap in k-means, while agglomerative classifies several things as overlapping.

Lets also again look at some seperate features. Will be again looking at attack and defense just as we did with K-means!

In [ ]:
sns.scatterplot(x = df['Attack'], y = df['Defense'], hue=y_agglo)
<Axes: xlabel='Attack', ylabel='Defense'>

Once again, pulling up the K-means visual for quick comparison. Can you not any similarities or differences once again?

Agglomerative resulted in a minutely tighter clustering of cluster 1 objects, with a few points with high defense scores being classified as 0 instead of 1. However, it also identified several more points in the <100 range as class 0. Overall, however, the shift is relatively minor to the naked eye and just makes drawing a split a little bitmore difficult.

In [ ]:
sns.scatterplot(x = df['Attack'], y = df['Defense'], hue=y)
<Axes: xlabel='Attack', ylabel='Defense'>

Lets make an interactive scatterplot again! Remember to note that the x- and y-axis are our PCA values (from dimensionality reduction). Below, we concat the dataframe along with the PCA values so that we can visualize properly. hover_data allows us to specify which columns we want to look at when hovering over each point.

In [ ]:
y_a_df = pd.DataFrame(y_agglo, columns=['Cluster (Agglomerative)'])
new_a_df = pd.concat([df, y_a_df], axis=1)
In [ ]:
fig = px.scatter(pd.concat([new_a_df, pca_df], axis = 1), 
                 x = 0, y = 1, color='Cluster (Agglomerative)', hover_data=['Name','Type 1','Type 2','Legendary','Total'])
fig.show()